High-Capacity Storage of Digital Information in DNA

ABSTRACT

A method for storage of an item of information ( 210 ) is disclosed. The method comprises encoding bytes ( 720 ) in the item of information ( 210 ), and representing using a schema the encoded bytes by a DNA nucleotide to produce a DNA sequence ( 230 ). The DNA sequence ( 230 ) is broken into a plurality of overlapping DNA segments ( 240 ) and indexing information ( 250 ) added to the plurality of DNA segments. Finally, the plurality of DNA segments ( 240 ) is synthesized ( 790 ) and stored ( 795 ).

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of U.S. patent applicationSer. No. 14/556,213 filed on Nov. 30, 2014, which is a continuation ofPCT Application No. PCT/EP2013/061300 filed on May 31, 2013, whichclaims the benefit of the filing date of U.S. Provisional PatentApplication Ser. No. 61/654,295, filed on Jun. 1, 2012. Theabove-referenced applications are hereby incorporated by reference.

SEQUENCE LISTING

The instant application contains a Sequence Listing which has beensubmitted electronically in ASCII format and is hereby incorporated byreference in its entirety. Said ASCII copy, created on Jun. 1, 2015, isnamed 4091.91216USC1_SL.txt and is 1,592 bytes in size.

BACKGROUND OF THE INVENTION Field of the Invention

The disclosure relates to a method and apparatus for the storage ofdigital information in DNA.

Brief Description of the Related Art

DNA has the capacity to hold vast amounts of information, readily storedfor long periods in a compact form. Bancroft, C., Bowler, T., Bloom, B.& Clelland, C. T. Long-term storage of information in DNA. Science 293,1763-1765 (2001) and Cox, J. P. L. Long-term data storage in DNA. TRENDSBiotech. 19, 247-250 (2001). The idea of using DNA as a store fordigital information has existed since 1995. Baum, E. B. Building anassociative memory vastly larger than the brain. Science 268, 583-585(1995). Physical implementations of DNA storage have to date stored onlytrivial amounts of information typically a few numbers or words ofEnglish text. Clelland, C. T., Risca, V. &—Bancroft, C. Hiding messagesin DNA microdots. Nature 399, 533-534 (1999); Kac, E. Genesis (1999)http://www.ekac.org/geninfo.html accessed online, 2 Apr. 2012; Wong, P.C., Wong, K.-K. & Foote, H. Organic data memory. Using the DNA approach.Comm. ACM 46, 95-98 (2003); Ailenberg, M. & Rotstein, O. D. An improvedHuffman coding method for archiving text, images, and music charactersin DNA. Biotechniques 47, 747-754 (2009); Gibson, D. G. et al. Creationof a bacterial cell controlled by a chemically synthesized genome.Science 329, 52-56 (2010). The inventors are unaware of large-scalestorage and recovery of arbitrarily sized digital information encoded inphysical DNA, rather than data storage on magnetic substrates or opticalsubstrates.

Currently the synthesis of DNA is a specialized technology focused onbiomedical applications. The cost of the DNA synthesis has been steadilydecreasing over the past decade. It is interesting to speculate at whattimescale data storage in a DNA molecule, as disclosed herein, would bemore cost effective than the current long term archiving process of datastorage on tape with rare but regular transfer to new media every 3 to 5years. Current “off the shelf” technology for DNA synthesis equates to aprice of around 100 bytes per U.S. dollar. Newer technology commerciallyavailable from Agilent Technologies (Santa Clara, Calif.) maysubstantially decrease this cost. However, account also needs to be madefor regular transfer of data between tape media. The questions are boththe costs for this transfer of data and whether this cost is fixed ordiminishes over time. If a substantial amount of the cost is assumed tobe fixed, then there is a time horizon at which use of DNA molecules fordata storage is more cost effective than regular data storage on thetape media. After 400 years (at least 80 media transfers), it ispossible that this data storage using DNA molecules is already costeffective.

The high capacity of DNA to store information stably under easilyachieved conditions has made DNA an attractive target for informationstorage since 1995. In addition to information density, DNA moleculeshave a proven track record as an information carrier, longevity of theDNA molecule is known and the fact that, as a basis of life on Earth,methods for manipulating, storing and reading the DNA molecule willremain the subject of continual technological innovation while thereremains DNA-based intelligent life. Data storage systems based on bothliving vector DNA (in vivo DNA molecules) and on synthesized DNA (invitro DNA) have been proposed. The in vivo data storage systems haveseveral disadvantages. Such disadvantages include constraints on thequantity, genomic elements and locations that can be manipulated withoutaffecting viability of the DNA molecules in the living vector organisms.Examples of such living vector organisms include but are not limited tobacteria. The reduction in viability includes decreasing capacity andincreasing the complexity of information encoding schemes. Furthermore,germline and somatic mutation will cause fidelity of the storedinformation and decoded information to be reduced over time and possiblya requirement for storage conditions of the living DNA to be carefullyregulated.

In contrast, the “isolated DNA” (i.e., in vitro DNA) is more easily“written” and routine recovery of examples of the non-living DNA fromsamples that are tens of thousands of years old indicates that awell-prepared non-living DNA sample should have an exceptionally longlifespan in easily-achieved low-maintenance environments (i.e. cold, dryand dark environments). See, Shapiro, B. et al. Rise and fall of theBeringian steppe bison. Science 306, 1561-1565 (2004); Poinar, H. K. etal. Metagenomics to paleogenomics: large-scale sequencing of mammothDNA. Science 311, 392-394 (2005); Willerslev, E. et al. Ancientbiomolecules from deep ice cores reveal a forested southern Greenland.Science 317, 111-114 (2007); Green, R. E. et al. A draft sequence of theNeanderthal genome. Science 328, 710-722 (2010); Anchordoquy, T. J. &Molina, M. C. Preservation of DNA. Cell Preservation Tech. 5, 180-188(2007); Bonnet, J. et al. Chain and conformation stability ofsolid-state DNA: implications for room temperature storage. Nucl. AcidsRes. 38, 1531-1546 (2010); Lee, S. B., Crouse, C. A. & Kline, M. C.Optimizing storage and handling of DNA extracts. Forensic Sci. Rev. 22,131-144 (2010).

Previous work on the storage of information (also termed data) in theDNA has typically focused on “writing” a human-readable message into theDNA in encoded form, and then “reading” the encoded human-readablemessage by determining the sequence of the DNA and decoding thesequence. Work in the field of DNA computing has given rise to schemesthat in principle permit large-scale associative (content-addressed)memory, but there have been no attempts to develop this work aspractical DNA-storage schemes. Baum, E. B. Building an associativememory vastly larger than the brain. Science 268, 583-585 (1995);Tsaftaris, S. A. & Katsaggelos, A. K. On designing DNA databases for thestorage and retrieval of digital signals. Lecture Notes Comp. Sci. 3611,1192-1201 (2005); Yamamoto, M., Kashiwamura, S., Ohuchi, A. & Furukawa,M. Large-scale DNA memory based on the nested PCR. Natural Computing 7,335-346 (2008); Kari, L. & Mahalingam, K. DNA computing: a researchsnapshot. In Atallah, M. J. & Blanton, M. (eds.) Algorithms and Theoryof Computation Handbook, vol. 2. 2nd ed. pp. 31-1-31-24 (Chapman & Hall,2009). FIG. 1 shows the amounts of information successfully encoded andrecovered in 14 previous studies (note the logarithmic scale on they-axis). Points are shown for 14 previous experiments (open circles) andfor the present disclosure (solid circle). The largest amount ofhuman-readable messages stored this way is 1280 characters of Englishlanguage text⁸, equivalent to approximately 6500 bits of Shannoninformation. Gibson, D. G. et al. Creation of a bacterial cellcontrolled by a chemically synthesized genome. Science 329, 52-56(2010); MacKay, D. J. C. Information Theory, Inference, and LearningAlgorithms. (Cambridge University Press, 2003).

The Indian Council of Scientific and Industrial Research has filed aU.S. Patent Application Publication No. 2005/0053968 (Bharadwaj et al)that teaches a method for storing information in DNA. The method of U.S.'968 comprises using an encoding method that uses 4-DNA basesrepresenting each character of an extended ASCII character set. Asynthetic DNA molecule is then produced, which includes the digitalinformation, an encryption key, and is flanked on each side by a primersequence. Finally, the synthesized DNA is incorporated in a storage DNA.In the event that the amount of DNA is too large, then the informationcan be fragmented into a number of segments. The method disclosed inU.S. '968 is able to reconstruct the fragmented DNA segments by matchingup the header primer of one of the segments with the tail primer on thesubsequent one of the segments.

Other patent publications are known which describe techniques forstoring information in DNA. For example, U.S. Pat. No. 6,312,911 teachesa steganographic method for concealing coded messages in DNA. The methodcomprises concealing a DNA encoded message within a genomic DNA samplefollowed by further concealment of the DNA sample to a microdot. Theapplication of this U.S. '911 patent is in particular for theconcealment of confidential information. Such information is generallyof limited length and thus the document does not discuss how to storeitems of information that are of longer length. The same inventors havefiled an International Patent Application published as InternationalPublication No. WO 03/025123.

SUMMARY OF THE INVENTION

A practical encoding-decoding procedure that stores more informationthan previously handled is described in this disclosure. The inventorshave encoded five computer files—totaling 757051 bytes (739 kB) of harddisk storage and with an estimated Shannon information of 5.2×10⁶bits—into a DNA code. The inventors subsequently synthesized this DNA,transported the synthesized DNA from the USA to Germany via the UK,sequenced the DNA and reconstructed all five computer files with 100%accuracy.

The five computer files included an English language text (all 154 ofShakespeare's sonnets), a PDF document of a classic scientific paper, aJPEG colour photograph and an MP3 format audio file containing 26seconds of speech (from Martin Luther King's “I Have A Dream” speech).Watson, J. D. & Crick, F. H. C. Molecular structure of nucleic acids.Nature 171, 737-738 (1953). This data storage represents approximately800 times as much information as the known previous DNA-based storageand covers a much greater variety of digital formats. The resultsdemonstrate that DNA storage is increasingly realistic and could, infuture, provide a cost-effective means of archiving digital informationand may already be cost effective for low access, multi-decade archivingtasks.

A method for storage of an item of information is disclosed. The methodcomprises encoding bytes in the item of information. The encoded bytesare represented using a schema by a DNA nucleotide to produce a DNAsequence in-silico. In a next step, the DNA sequence is split into aplurality of overlapping DNA segments and indexing information is addedto the plurality of DNA segments. Finally, the plurality of DNA segmentsis synthesized and stored.

The addition of the indexing information to the DNA segments means thatthe position of the segments in the DNA sequence representing the itemof information can be uniquely identified. There is no need to rely on amatching of a head primer with a tail primer. This makes it possible torecover almost the entire item of information, even if one of thesegments has failed to reproduce correctly. If no indexing informationwere present, then there is a risk that it might not be possible tocorrectly reproduce the entire item of information if the segments couldnot be matched to each other due to “orphan” segments whose position inthe DNA sequence cannot be clearly identified.

The use of overlapping DNA segments means that a degree of redundancy isbuilt into the storage of the items of information. If one of the DNAsegments cannot be decoded, then the encoded bytes can still berecovered from neighboring ones of the DNA segments. Redundancy istherefore built into the system.

Multiple copies of the DNA segments can be made using known DNAsynthesis techniques. This provides an additional degree of redundancyto enable the item of information to be decoded, even if some of copiesof the DNA segments are corrupted and cannot be decoded.

In one aspect of the invention, the representation schema used forencoding is designed such that adjacent ones of the DNA nucleotides aredifferent. This is to increase the reliability of the synthesis,reproduction and sequencing (reading) of the DNA segments

In a further aspect of the invention, a parity-check is added to theindexing information. This parity check enables erroneous synthesis,reproduction or sequencing of the DNA segments to be identified. Theparity-check can be expanded to also include error correctioninformation.

Alternate ones of the synthesized DNA segments are reverse complemented.These provide an additional degree of redundancy in the DNA and meansthat there is more information available if any of the DNA segment iscorrupted.

DESCRIPTION OF THE FIGURES

For a more complete understanding of the present invention and theadvantages thereof, reference is now made to the following descriptionand the accompanying drawings, in which:

FIG. 1 is a graph of amounts of information stored in DNA andsuccessfully recovered, as a function of time.

FIG. 2 shows an example of the method of the present disclosure. FIG. 2discloses SEQ ID NO: 4.

FIG. 3 shows a graph of the cost effectiveness of storage over time.

FIG. 4 shows a motif with a self-reverse complementary pattern.

FIG. 5 shows the encoding efficiency.

FIG. 6 shows error rates.

FIG. 7 shows a flow diagram of the encoding of the method.

FIG. 8 shows a flow diagram of the decoding of the method.

DETAILED DESCRIPTION

One of the main challenges for a practical implementation of DNA storageto date has been the difficulty of creating long sequences of DNA to aspecified design. The long sequences of DNA are required to store largedata files, such as long text items and videos. It is also preferable touse an encoding with a plurality of copies of each designed DNA. Suchredundancy guards against both encoding and decoding errors, as will beexplained below. It is not cost-efficient to use a system based onindividual long DNA chains to encode each (potentially large) message.The inventors have developed a method that uses ‘indexing’ informationassociated with each one of the DNA segments to indicate the position ofthe DNA segment in a hypothetical longer DNA molecule that encodes theentire message.

The inventors used methods from code theory to enhance therecoverability of the encoded messages from the DNA segment, includingforbidding DNA homopolymers (i.e. runs of more than one identical base)that are known to be associated with higher error rates in existing highthroughput technologies. The inventors further incorporated a simpleerror-detecting component, analogous to a parity-check bit⁹ into theindexing information in the code. More complex schemes, including butnot limited to error-correcting codes and, indeed, substantially anyform of digital data security (e.g. RAID-based schemes) currentlyemployed in informatics, could be implemented in future developments ofthe DNA storage scheme. See, Baum, E. B. Building an associative memoryvastly larger than the brain. Science 268, 583-585 (1995) and Chen, P.M., Lee, E. K., Gibson, G. A., Katz, R. H. & Patterson, D. A. RAID:high-performance, reliable secondary storage. ACM Computing Surveys 26,145-185 (1994).

The inventors selected five computer files to be encoded as aproof-of-concept for the DNA storage of this disclosure. Rather thanrestricting the files to human-readable information, files using a rangeof common formats were chosen. This demonstrated the ability of theteachings of the disclosure to store arbitrary types of digitalinformation. The files contained all 154 of Shakespeare's sonnets (inTXT format), the complete text and figure of ref 10 (in PDF format), amedium-resolution color photograph of the EMBL-European BioinformaticsInstitute (JPEG 2000 format), a 26 second extract from Martin LutherKing's “I Have A Dream” speech (MP3 format) and a file defining theHuffman code used in this study to convert bytes to base-3 digits (as ahuman-readable text file).

The five files selected for DNA-storage were as follows:

-   -   wssnt10.txt—107738 bytes—ASCII text format all 154 Shakespeare        sonnets (from Project Gutenberg,        http://www.gutenberg.org/ebooks/1041)    -   watsoncrick.pdf—280864 bytes—PDF format document Watson and        Crick's (1953) publication¹⁰ describing the structure of DNA        (from the Nature website,        http://www.nature.com/nature/dna50/archive.html, modified to        achieve higher compression and thus smaller file size).    -   EBI.jp2—184264 bytes—JPEG 2000 format image file color        photograph (16.7M colors, 640×480 pixel resolution) of the        EMBL-European Bioinformatics Institute (own picture).    -   MLK_excerpt_VBR_45-85.mp3—168539 bytes—MP3 format sound file 26        second-long extract from Martin Luther King's “I Have A Dream”        speech (from        http://www.americanrhetoric.com/speeches/mlkihaveadream.htm,        modified to achieve higher compression: variable bit rate,        typically 48-56 kbps; sampling frequency 44.1 kHz)    -   View huff3.cd.new—15646 bytes—ASCII file human-readable file        defining the Huffman code used in this study to convert bytes to        base-3 digits (trits)

The five computer files comprise a total of 757051 bytes, approximatelyequivalent to a Shannon information of 5.2×10⁶ bits or 800 times as muchencoded and recovered human-designed information as the previous maximumamount known to have been stored (see FIG. 1 ).

The DNA encoding of each one of the computer files was computed usingsoftware and the method is illustrated in FIG. 7 . In one aspect of theinvention 700 described herein, the bytes comprising each computer file210 were represented in step 720 as a DNA sequence 230 with nohomopolymers by an encoding scheme to produce an encoded file 220 thatreplaces each byte by five or six bases (see below) forming the DNAsequence 230. The code used in the encoding scheme was constructed topermit a straightforward encoding that is close to the optimuminformation capacity for a run length-limited channel (i.e., no repeatednucleotides). It will, however, be appreciated that other encodingschemes may be used.

The resulting in silico DNA sequences 230 are too long to be readilyproduced by standard oligonucleotide synthesis. Each of the DNAsequences 230 was therefore split in step 730 into overlapping segments240 of length 100 bases with an overlap of 75 bases. To reduce the riskof systematic synthesis errors introduced to any particular run ofbases, alternate ones of the segments were then converted in step 740 totheir reverse complement, meaning that each base is “written” fourtimes, twice in each direction. Each segment was then augmented in step750 with an indexing information 250 that permitted determination of thecomputer file from which the segment 240 originated and its locationwithin that computer file 210, plus simple error-detection information.This indexing information 250 was also encoded in step 760 asnon-repeating DNA nucleotides, and appended in step 770 to the 100information storage bases of the DNA segments 240. It will beappreciated that the division of the DNA segments 240 into lengths of100 bases with an overlap of 75 bases is purely arbitrary. It would bepossible for other lengths and overlaps to be used and this is notlimiting of the invention.

In total, all of the five computer files were represented by 153335strings of DNA. Each one of the strings of DNA comprised 117 nucleotides(encoding original digital information plus indexing information). Theencoding scheme used had various features of the synthesized DNA (e.g.uniform segment lengths, absence of homopolymers) that made it obviousthat the synthesized DNA did not have a natural (biological) origin. Itis therefore obvious that the synthesized DNA has a deliberate designand encoded information. See, Cox, J. P. L. Long-term data storage inDNA. TRENDS Biotech. 19, 247-250 (2001).

As noted above, other encoding schemes for the DNA segments 240 could beused, for example to provide enhanced error-correcting properties. Itwould also be straightforward to increase the amount of indexinginformation in order to allow more or larger files to be encoded. It hasbeen suggested that the Nested Primer Molecular Memory (NPMM) schemereaches its practical maximum capacity at 16.8M unique addresses, andthere appears to be no reason why the method of the disclosure could notbe extended beyond this to enable the encoding of almost arbitrarilylarge amounts of information. See, Yamamoto, M., Kashiwamura, S.,Ohuchi, A. & Furukawa, M. Large-scale DNA memory based on the nestedPCR. Natural Computing 7, 335-346 (2008) and Kari, L. & Mahalingam, K.DNA computing: a research snapshot. In Atallah, M. J. & Blanton, M.(eds.) Algorithms and Theory of Computation Handbook, vol. 2. 2nd ed.pp. 31-1-31-24 (Chapman & Hall, 2009)

One extension to the coding scheme in order to avoid systematic patternsin the DNA segments 240 would be to add change the information. Two waysof doing this were tried. A first way involved the “shuffling” ofinformation in the DNA segments 240, the information can be retrieved ifone knows the pattern of shuffling. In one aspect of the disclosuredifferent patterns of shuffles were used for different ones of the DNAsegments 240.

A further way is to add a degree of randomness into the information ineach one of the DNA segments 240. A series of random digits can be usedfor this, using modular addition of the series of random digits and thedigits comprising the information encoded in the DNA segments 240. Theinformation can easily be retrieved by modular subtraction duringdecoding if one knows the series of random digits used. In one aspect ofthe disclosure, different series of random digits were used fordifferent ones of the DNA segments 240.

The digital information encoding in step 720 was carried out as follows.The five computer files 210 of digital information (represented in FIG.2A) stored on a hard-disk drive were encoded using software. Each byteof each one of the five computer files 210 to be encoded in step 720 wasrepresented as a sequence of DNA bases via base-3 digits (‘trits’ 0, 1and 2) using a purpose-designed Huffman code listed in Table 1 (below)to produce the encoded file 220. This exemplary coding scheme is shownin outline in FIG. 2B. Each of the 256 possible bytes was represented byfive or six trits. Subsequently, each one of the trits was encoded as aDNA nucleotide 230 selected from the three nucleotides different fromthe previous nucleotide (FIG. 2C). In other words, in the encodingscheme chosen for this aspect of the disclosure, each one of the threenucleotides was different from the previous one used to ensure nohomopolymers. The resulting DNA sequence 230 was split in step 730 toDNA segments 240 of length 100 bases, as shown in FIG. 2D. Each one ofthe DNA segments overlapped the previous DNA segment by 75 bases, togive DNA segments of a length that was readily synthesized and toprovide redundancy. Alternate ones of the DNA segments were reversecomplemented.

The indexing information 250 comprised two trits for file identification(permitting 3²=9 files to be distinguished, in this implementation), 12trits for intra-file location information (permitting 3¹²=531441locations per file) and one ‘parity-check’ trit. The indexinginformation 250 was encoded in step 760 as non-repeating DNA nucleotidesand was appended in step 770 to the 100 information storage bases. Eachindexed DNA segment 240 had one further base added in step 780 at eachend, consistent with the ‘no homopolymers’ rule, that would indicatewhether the entire DNA segment 240 were reverse complemented during the‘reading’ stage of the experiment.

In total, the five computer files 210 were represented by 153335 stringsof DNA, each comprising 117 (1+100+2+12+1+1) nucleotides (encodingoriginal digital information and indexing information).

The data-encoding component of each string in the aspect of theinvention described herein can contain Shannon information at 5.07 bitsper DNA base, which is close to the theoretical optimum of 5.05 bits perDNA base for base-4 channels with run length limited to one. Theindexing implementation 250 permits 3¹⁴=4782969 unique data locations.Increasing the number of indexing trits (and therefore bases) used tospecify file and intra-file location by just two, to 16, gives3¹⁶=43046721 unique locations, in excess of the 16.8M that is thepractical maximum for the NPMM scheme. See, Yamamoto, M., Kashiwamura,S., Ohuchi, A. & Furukawa, M. Large-scale DNA memory based on the nestedPCR. Natural Computing 7, 335-346 (2008) and Kari, L. & Mahalingam, K.DNA computing: a research snapshot. In Atallah, M. J. & Blanton, M.(eds.) Algorithms and Theory of Computation Handbook, vol. 2. 2nd ed.pp. 31-1-31-24 (Chapman & Hall, 2009)

The DNA synthesis process of step 790 was also used to incorporate 33 bpadapters to each end of each one of the oligonucleotides (oligo) tofacilitate sequencing on Illumina sequencing platforms:

5′ adapter: (SEQ ID NO: 1) ACACTCTTTCCCTACACGACGCTCTTCCGATCT 3′ adapter:(SEQ ID NO: 2) AGATCGGAAGAGCGGTTCAGCAGGAATGCCGAG

The 153335 DNA segment designs 240 were synthesized in step 790 in threedistinct runs (with the DNA segments 240 randomly assigned to runs)using an updated version of Agilent Technologies' OLS (Oligo LibrarySynthesis) process described previously^(22, 23) to create approx.1.2×10⁷ copies of each DNA segment design. Errors were seen to occur inonly about one error per 500 bases and independently in different copiesof the DNA segments 240. Agilent Technologies adapted thephosphoramidite chemistry developed previously²⁴ and employed inkjetprinting and flow cell reactor technologies in Agilent's SurePrint insitu microarray synthesis platform. The inkjet printing within ananhydrous chamber allows the delivery of very small volumes ofphosphoramidites to a confined coupling area on a 2D planar surface,resulting in the addition of hundreds of thousands of bases in parallel.Subsequent oxidation and detritylation are carried out in a flow cellreactor. Once the DNA synthesis has been completed, the oligonucleotidesare then cleaved from the surface and deprotected. See, Cleary, M. A. etal. Production of complex nucleic acid libraries using highly parallelin situ oligonucleotide synthesis. Nature Methods 1, 241-248 (2004).

The adapters were added to the DNA segments to enable a plurality ofcopies of the DNA segments to be easily made. A DNA segment with noadapter would require additional chemical processes to “kick start” thechemistry for the synthesis of the multiple copies by adding additionalgroups onto the ends of the DNA segments.

Up to ˜99.8% coupling efficiency is achieved by using thousands-foldexcess of phosphoramidite and activator solution. Similarly,millions-fold excess of detritylation agent drives the removal of the5′-hydroxyl protecting group to near completion. A controlled process inthe flowcell reactor significantly reduced depurination, which is themost prevalent side reaction. See, Le Proust, E. M. et al. Synthesis ofhigh-quality libraries of long (150 mer) oligonucleotides by a noveldepurination controlled process. Nucl. Acids Res. 38, 2522-2540 (2010).Up to 244000 unique sequences can be synthesized in parallel anddelivered as ˜1-10 picomole pools of oligos.

The three samples of lyophilized oligos were incubated in Tris bufferovernight at 4° C., periodically mixed by pipette and vortexing, andfinally incubated at 50° C. for 1 hour, to a concentration of 5 ng/ml.As insolubilized material remained, the samples were left for a further5 days at 4° C. with mixing two-four times each day. The samples werethen incubated at 50° C. for 1 hour and 68° C. for 10 minutes, andpurified from residual synthesis by-products on Ampure XP paramagneticbeads (Beckman Coulter) and could be stored in step 795. Sequencing anddecoding is shown in FIG. 8 .

The combined oligo sample was amplified in step 810 (22 PCR cycles usingthermocycler conditions designed to give even A/T vs. G/C processing²⁶)using paired-end Illumina PCR primers and high-fidelity AccuPrimereagents (Invitrogen), a combination of Taq and Pyrococcus polymeraseswith a thermostable accessory protein. The amplified products were beadpurified and quantified on an Agilent 2100 Bioanalyzer, and sequencedusing AYB software in paired-end mode on an Illumina HiSeq 2000 toproduce reads of 104 bases.

The digital information decoding was carried out as follows. The central91 bases of each oligo were sequenced in step 820 from both ends and sorapid computation of full-length (117 base) oligos and removal ofsequence reads inconsistent with the designs was straightforward. Thesequence reads were decoded in step 830 using computer software thatexactly reverses the encoding process. The sequence reads for which theparity-check trit indicated an error or that at any stage could not beunambiguously decoded or assigned to a reconstructed computer file werediscarded in step 840 from further consideration.

The vast majority of locations within every decoded file were detectedin multiple different sequenced DNA oligos, and simple majority votingin step 850 was used to resolve any discrepancies caused by the DNAsynthesis or the sequencing errors. On completion of this procedure 860,four of the five original computer files 210 were reconstructedperfectly. The fifth computer file required manual intervention tocorrect two regions each of 25 bases that were not recovered from anysequenced read.

During decoding in step 850, it was noticed that one file (ultimatelydetermined to be watsoncrick.pdf) reconstructed in silico at the levelof DNA (prior to decoding, via base-3, to bytes) contained two regionsof 25 bases that were not recovered from any one of the sequencedoligos. Given the overlapping segment structure of the encoding, eachregion indicated the failure of four consecutive segments to besynthesized or sequenced, as any one of four consecutive overlappingsegments would have contained bases corresponding to this location.Inspection of the two regions indicated that the non-detected bases fellwithin long repeats of the following 20-base motif:

(SEQ ID NO: 3) 5′ GAGCATCTGCAGATGCTCAT 3′

It was noticed that repeats of this motif have a self-reversecomplementary pattern. These are shown in FIG. 4 .

It is possible that long, self-reverse complementary DNA segments mightnot be readily sequenced using the Illumina paired-end process, owing tothe possibility that the DNA segments might form internal nonlinearstem-loop structures that would inhibit the sequencing-by-synthesisreaction used in the protocol used in the method described in thisdocument. Consequently, the in silico DNA sequence was modified torepair the repeating motif pattern and then subjected to subsequentdecoding steps. No further problems were encountered, and the finaldecoded file matched perfectly the file watsoncrick.pdf. A code thatensured that no long self-complementary regions existed in any of thedesigned DNA segments could be used in future.

Example of Huffman Coding Scheme

Table 1 shows an example of the exemplary Huffman coding scheme used toconvert byte values (0-255) to base-3. For highly compressedinformation, each byte value should appear equally frequently and themean number of trits per byte will be (239*5+17*6)/256=5.07. Thetheoretical maximum number of trits per byte is log(256)/log(3)=5.05.

TABLE 1 Base 3 Coding Code Word No 8-bit ASCII Character Byte Value (5or 6 trits) 0 0 22201 1 U 85 22200 2  ™ 170 22122 3 127 22121 4 ” 25322120 5 4 52 22112 6 ä 138 22111 7 ) 41 22110 8 V 86 22102 9 * 42 2210110 d 100 22100 11 , 44 22022 12 {dot over ( )} 250 22020 13 Ñ 132 2202114 ° 161 22012 15 b 98 22010 16 8 22002 17 ″ 34 22011 18 [NL] 10 2200119 ï 149 22000 20 W 87 21222 21 21 21221 22 J 74 21220 23 $ 36 21212 24E 69 21210 25 ± 177 21202 26 20 21211 27 ' 213 21200 28 £ 163 21201 29 Â229 21121 30 {hacek over ( )} 255 21122 31 ≈ 197 21120 32 Ö 133 21112 33, 252 21110 34 26 21111 35 ≠ 173 21101 36 ó 151 21102 37 R 82 21100 38 K75 21022 39 % 37 21021 40 ¶ 166 21011 41 ø 191 21020 42 X 88 21012 43 ?63 21010 44 D 68 21001 45 ñ 150 21002 46 L 76 21000 47 4 20222 48 ö 15420221 49 Í 234 20212 50 22 20220 51 ¢ 162 20211 52 i 105 20210 53 f 10220202 54 {acute over ( )} 171 20201 55 h 104 20200 56  © 169 20122 57 f196 20121 58 — 208 20120 59 T 84 20112 60 ç 130 20111 61 í 146 20102 62H 72 20110 63 16 20101 64 B 66 20100 65 24 20022 66 j 106 20012 67 fl223 20020 68 : 58 20021 69 â 137 20011 70 I 73 20010 71 e 101 20001 72 ® 168 20002 73 μ 181 12221 74 Ø 175 12222 75 ° 251 20000 76 ( 40 1222077 å 140 12212 78 17 12211 79 S 83 12210 80

254 12202 81 240 12201 82 ÷ 214 12200 83 5 53 12122 84 202 12112 85 2512121 86 18 12120 87 {tilde over ( )} 247 12111 88 Æ 174 12110 89 p 11212102 90 Y 89 12101 91 “ 210 12100 92 Ÿ 217 12012 93   248 12020 94 ¬194 12021 95 ∂ 182 12022 96 P 80 12011 97 O 79 12002 98 √ 195 12010 9912 12001 100 — 209 12000 101 • 165 11222 102 1 245 11221 103 2 11220 104Q 81 11212 105 & 38 11211 106 ç 141 11202 107 ” 211 11210 108 Ô 23911200 109 − 95 11201 110 + 43 11122 111 ‡ 224 11121 112 À 203 11112 113ë 145 11120 114 ì 147 11110 115 19 11111 116 2 50 11101 117 à 136 11102118 k 107 11100 119 Ü 134 11022 120 m 109 11021 121 ô 153 11020 122 î148 11002 123 Õ 205 11010 124 ‘ 212 11011 125 6 54 11012 126 Ò 241 11000127 ú 156 11001 128 s 115 10222 129 t 116 10221 130 N 78 10220 131 C 6710211 132 F 70 10212 133 ≤ 178 10210 134 ü 159 10202 135 é 142 10201 136\ 92 10200 137 0 48 10122 138 Z 90 10120 139 / 218 10121 140 ~ 126 10112141 ′ 39 10111 142 € 219 10102 143 β 167 10110 144 r 114 10101 145{umlaut over ( )} 172 10022 146 14 10100 147 x 120 10020 148 ã 139 10021149 † 160 10012 150 ! 33 10011 151 ≥ 179 10010 152 u 117 10002 153 • 22510001 154 Å 129 10000 155 Σ 183 02222 156 Ê 230 02220 157 # 35 02221 158] 93 02210 159 6 02211 160 32 02212 161 8 56 02201 162 û 158 02202 163 π185 02121 164 / 47 02122 165 è 143 02200 166 { 123 02111 167 Ã 204 02120168 Ú 242 02112 169 o 111 02110 170 g 103 02102 171 l 108 02101 172[TAB] 9 02100 173 A 65 02022 174 {hacek over ( )} 249 02020 175 [CR] 1302021 176 ¥ 180 02012 177 , 226 02001 178 ê 144 02002 179 15 02010 180 957 02011 181 Ä 128 02000 182 á 135 01220 183 Û 243 01221 184 æ 190 01222185 œ 207 01212 186 M 77 01211 187 - 45 01210 188 [ 91 01202 189 ¿ 19201201 190 ∫ 186 01122 191 ÿ 216 01200 192 a 97 01112 193 v 118 01120 194{circumflex over ( )} 246 01121 195 ⋄ 215 01111 196 3 51 01102 197

206 01110 198 Π 184 01100 199 ” 227 01101 200 È 233 01022 201 Ì 23701021 202 ° 188 01020 203 q 113 01012 204 1 49 01011 205 ... 201 01010206 õ 155 01002 207 fi 222 01000 208 Á 231 01001 209 5 00222 210 2700221 211 É 131 00212 212 § 164 00220 213 3 00211 214 · 46 00210 215 w119 00201 216 28 00202 217 ∞ 176 00200 218 23 00122 219 @ 64 00121 220 ù157 00120 221 ^(a) 187 00112 222 Ù 244 00110 223 Ó 238 00111 224 {graveover ( )} 96 00102 225 Î 235 00101 226 < 60 00022 227 1 00100 228 n 11000021 229 » 200 00011 230

221 00020 231 c 99 00012 232 31 00010 233 Δ 198 00002 234

193 00001 235 } 125 00000 236 | 124 22222 237 ò 152 22222 238 z 12222222 239 G 71 222212 240 {circumflex over ( )} 94 222211 241

220 222210 242 29 222202 243 « 199 222201 244 = 61 222200 245 11 222122246 ‰ 228 222121 247 > 62 222120 248 7 55 222112 249 y 121 222111 250 7222110 251 - 30 222102 252 Ë 232 222101 253 Ω 189 222100 254 ; 59 222021255 Ï 236 222022

Encoding of the File

The arbitrary computer file 210 is represented as a string S_(Ø) ofbytes (often interpreted as a number between Ø and 2⁸−1, i.e. a value inthe set {0 . . . 255}). The string S_(Ø) is encoded using the Huffmancode and converting to base-3. This generates a string S₁ of charactersas the trit {Ø, 1, 2}.

Let us now write len( ) for the function that computes the length (incharacters) of the string S₁, and define n=len(S₁). Represent n inbase-3 and prepend 0s to generate a string S₂ of trits such thatlen(S₂)=20. Form the string concatenation S₄=S₁. S₃. S₂, where S₃ is astring of at most 24 zeros is chosen so that len(S₄) is an integermultiple of 25.

S₄ is converted to the DNA string S₅ of characters in {A, C, G, T} withno repeated nucleotides (nt) using the scheme illustrated in the tablebelow. The first trit of S4 is coded using the ‘A’ row of the table. Foreach subsequent trit, characters are taken from the row defined by theprevious character conversion.

previous next trit to encode nt written Ø 1 2 A C G T C G T A G T A C TA C G

Table: Base-3 to DNA encoding ensuring no repeated nucleotides.

For each trit t to be encoded, select the row labeled with the previousnucleotide used and the column labeled t and encode using the nt in thecorresponding table cell.

Define N=len (S₅), and let ID be a 2-trit string identifying theoriginal file and unique within a given experiment (permitting mixing ofDNA form different files So in one experiment. Split S₅ into theoverlapping DNA segments 240 of length 100 nt, each of the DNA segments240 being offset from the previous one of the DNA segments 240 by 25 nt.This means there will be ((N/25)−3) DNA segments 240, convenientlyindexed i=Ø . . . (N/25)−4. The DNA segment i is denoted F_(i) andcontains (DNA) characters 25i . . . 25_(i+99) of S₅.

Each DNA segment F_(i) is further processed as follows:

If i is odd, reverse complement the DNA segment F_(i).

Let i3 be the base-3 representation of i, appending enough leading zerosso that len(i3)=12. Compute P as the sum (mod 3) of the odd-positionedtrits in ID and i3, i.e. ID₁+i3₁+i3₃+i3₅+i3₇+i3₉+i3₁₁. (P acts a ‘paritytrit’—analogous to a parity bit—to check for errors in the encodedinformation about ID and i.)

Form the indexing information 250 string IX=ID. i2. P (comprising2+12+1=15 trits). Append the DNA-encoded (step 760) version of IX toF_(i) using the same strategy as shown in the above table, starting withthe code table row defined by the last character of F, to give indexedsegment F′_(i).

Form F″_(i) by prepending A or T and appending C or G to F_(i)—choosingbetween A and T, and between C and G, randomly if possible but alwayssuch that there are no repeated nucleotides. This ensures that one candistinguish a DNA segment 240 that has been reverse complemented (step240) during DNA sequencing from one that has not. The former will startwith Q|C and the end with T|A; the latter will start A|T and end C|G.

The segments F″₁ are synthesized in step 790 as actual DNAoligonucleotides and stored in step 790 and may be supplied forsequencing in step 820.

Decoding

Decoding is simply reverse of the encoding in step 720, starting withthe sequenced DNA segments 240 F″₁ of length 117 nucleotides. Reversecomplementation during the DNA sequencing procedure (e.g. during PCRreactions) can be identified for subsequent reversal by observingwhether fragments start with A|T and end with C|G, or start with G|C andend T|A. With these two ‘orientation’ nucleotides removed, the remaining115 nucleotide of each DNA segment 240 can be split into the first 100‘message’ nucleotides and the remaining fifteen ‘indexing information250’ nucleotides. The indexing information nucleotide 250 can be decodedto determine the file identifier ID and the position index i3 and hencei, and errors may be detected by testing the parity trit P. Positionindexing information 250 permits the reconstruction of the DNA-encodedfile 230, which can then be converted to base-3 using the reverse of theencoding table above and then to the original bytes using the givenHuffman code.

Discussion on Data Storage

The DNA storage has different properties from the traditional tape-basedstorage or disk-based storage. The ˜750 kB of information in thisexample was synthesized in 10 pmol of DNA, giving an information storagedensity of approximately one Terabyte/gram. The DNA storage requires nopower and remains (potentially) viable for thousands of years even byconservative estimates.

DNA Archives can also be copied in a massively parallel manner by theapplication of PCR to the primer pairs, followed by aliquoting(splitting) the resulting DNA solution. In the practical demonstrationof this technology in the sequencing process this procedure was donemultiple times, but this could also be used explicitly for copying atlarge scale the information and then physically sending this informationto two or more locations. The storage of the information in multiplelocations would provide further robustness to any archiving scheme, andmight be useful in itself for very large scale data copying operationsbetween facilities.

The decoding bandwidth in this example was at 3.4 bits/second, comparedto disk (approximately one Terabit/second) or tape (140 Megabit/second),and latency is also high (˜20 days in this example). It is expected thatfuture sequencing technologies are likely to improve both these factors.

Modeling the full cost of archiving using either the DNA-storage of thisdisclosure or the tape storage shows that the key parameters are thefrequency and fixed costs of transitioning between tape storagetechnologies and media. FIG. 3 shows the timescales for whichDNA-storage is cost-effective. The upper bold curve indicates thebreak-even time (x-axis) beyond which the DNA storage as taught in thisdisclosure is less expensive than tape. This assumes that the tapearchive has to be read and re-written every 3 years (f=⅓), and dependson the relative cost of DNA-storage synthesis and tape transfer fixedcosts (y-axis). The lower bold curve corresponds to tape transfers every5 years. The region below the lower bold curve indicates cases for whichthe DNA storage is cost-effective when transfers occur more frequentlythan every 5 years; between the two bold curves, the DNA storage iscost-effective when transfers occur from 3- to 5-yearly; and above theupper bold curve tape is less expensive when transfers occur lessfrequently than every 3 years. The dotted horizontal lines indicateranges of relative costs of DNA synthesis to tape transfer of 125-500(current values) and 12.5-50 (achieved if DNA synthesis costs reduce byan order of magnitude). Dotted vertical lines indicate correspondingbreak-even times. Note the logarithmic scales on all axes.

One issue for long-term digital archiving is how DNA-based storagescales to larger applications. The number of bases of the synthesizedDNA needed to encode the information grows linearly with the amount ofinformation to be stored. One must also consider the indexinginformation required to reconstruct full-length files from the short DNAsegments 240. The indexing information 250 grows only as the logarithmof the number of DNA segments 240 to be indexed. The total amount ofsynthesized DNA required grows sub-linearly. Increasingly large parts ofeach ones of the DNA segments 240 are needed for indexing however and,although it is reasonable to expect synthesis of longer strings to bepossible in future, the behavior of the scheme was modeled under theconservative constraint of a constant 114 nucleotides available for boththe data and the indexing information 250.

As the total amount of information increases, the encoding efficiencydecreases only slowly (FIG. 5 ).In the experiment (megabyte scale) theencoding scheme is 88% efficient. FIG. 5 indicates that efficiencyremains >70% for data storage on petabyte (PB, 10¹⁵ bytes) scalesand >65% on exabyte (EB, 10¹⁸ bytes) scales, and that DNA-based storageremains feasible on scales many orders of magnitude greater than currentglobal data volumes. FIG. 5 also shows that costs (per unit informationstored) rise only slowly as data volumes increase over many orders ofmagnitude. Efficiency and costs scale even more favourably if weconsider the lengths of the synthesized DNA segments 240 available usingthe latest technology. As the amount of information stored increases,decoding requires more oligos to be sequenced. A fixed decodingexpenditure per byte of encoded information would mean that each base isread fewer times and so is more likely to suffer decoding error.Extension of the scaling analysis to model the influence of reducedsequencing coverage on the per-decoded-base error rate revealed thaterror rates increase only very slowly as the amount of informationencoded increases to a global data scale and beyond. This also suggeststhat the mean sequencing coverage of 1,308 times was considerably inexcess of that needed for reliable decoding. This was confirmed bysubsampling from the 79.6×310⁶ read-pairs to simulate experiments withlower coverage.

FIG. 5 indicates that reducing the coverage by a factor of 10 (or evenmore) would have led to unaltered decoding characteristics, whichfurther illustrates the robustness of the DNA-storage method.Applications of the DNA-based storage might already be economicallyviable for long-horizon archives with a low expectation of extensiveaccess, such as government and historical records. An example in ascientific context is CERN's CASTOR system, which stores a total of 80PB of Large Hadron Collider data and grows at 15 PB yr⁻¹. Only 10% ismaintained on disk, and CASTOR migrates regularly between magnetic tapeformats. Archives of older data are needed for potential futureverification of events, but access rates decrease considerably 2-3 yearsafter collection. Further examples are found in astronomy, medicine andinterplanetary exploration.

FIG. 5 shows the encoding efficiency and costs change as the amount ofstored information increases. The x-axis (logarithmic scale) representsthe total amount of information to be encoded. Common data scales areindicated, including the three zettabyte (3 ZB, 3×10²¹ bytes) globaldata estimate. The y-axis scale to left indicates encoding efficiency,measured as the proportion of synthesized bases available for dataencoding. The y-axis scale to right indicate the corresponding effect onencoding costs, both at current synthesis cost levels (solid line) andin the case of a two-order-of magnitude reduction (dashed line).

FIG. 6 shows per-recovered-base error rate (y-axis) as a function ofsequencing coverage, represented by the percentage of the original79.6×10⁶ read-pairs sampled (x axis; logarithmic scale). One curverepresents the four files recovered without human intervention: theerror is zero when >2% of the original reads are used. Another curve isobtained by Monte Carlo simulation from our theoretical error ratemodel. The final curve represents the file (watsoncrick.pdf) thatrequired manual correction: the minimum possible error rate is 0.0036%.The boxed area is shown magnified in the inset.

In addition to data storage, the teachings of this disclosure can alsobe used for steganography.

What is claimed is:
 1. A method of creating a plurality of DNA segmentsdata to be provided to a DNA synthesis platform for controlling thesynthesis of a plurality of DNA segments for storing an item ofinformation, the method comprising: encoding bytes in an item ofinformation, stored in a first computer file as a DNA sequence data,using a representation schema to represent the encoded bytes as at leastone DNA nucleotide datum in the DNA sequence data; splitting the DNAsequence data into a plurality of overlapping DNA segments data; addingindexing information to the plurality of DNA segments data, the indexinginformation indicating a position in the DNA sequence data of any onenucleotide datum of any one of the plurality of DNA segments data; andstoring the plurality of DNA segments data in a machine-readable secondcomputer file.
 2. The method of claim 1, further including the additionof adapters data to the DNA segments data.
 3. The method of claim 1using a base-3 scheme for encoding the bytes.
 4. The method of claim 1,wherein the representation schema used is designed such that adjacentones of the DNA nucleotide data are different.
 5. The method of claim 1,further comprising adding a parity-check to the indexing information. 6.The method of claim 1, wherein alternate ones of the DNA segments dataare reverse complemented.
 7. The method of claim 1, wherein therepresentation schema used is designed to avoid long, self-reversecomplementary DNA segments data.
 8. The method of claim 1, furthercomprising providing to a DNA synthesis platform the plurality of DNAsegments data for controlling the synthesis of a plurality of DNAsegments from the DNA segments data.
 9. The method of claim 1, furthercomprising the step of synthesizing from the DNA segments data aplurality of DNA segments for storing an item of information.
 10. Anon-volatile, non-transitory storage medium storing a plurality of DNAsegments data to be provided to a DNA synthesis platform for controllingthe synthesis of a plurality of DNA segments for storing an item ofinformation, wherein the plurality of DNA segments data are created by amethod comprising: encoding bytes in an item of information, stored in afirst computer file, as a DNA sequence data using a representationschema to represent the encoded bytes as at least one DNA nucleotidedatum in the DNA sequence data; splitting the DNA sequence data into aplurality of overlapping DNA segments data; and adding indexinginformation to the plurality of DNA segments data, the indexinginformation indicating a position in the DNA sequence data of any one ofthe plurality of DNA segments data.
 11. A computer program productcomprising logic for executing the method according to claim 1.